🐼: panel data, apparently. 1
Reasons to use
Reasons not to use
data/scores.csv: students’ scores in 5 subjectsdata/contacts.csv: students’ contact detailsA brief aside on a first step that tends to get overlooked.
pandas is a library for data analysisimportimport it:fromfrom command to import objects and methods directly from modules into the global namespace.asas:RR, the default approach is to import all methods from a library into the global namespace.pandas it would look like this:ArrowDtype BooleanDtype Categorical CategoricalDtype CategoricalIndex DataFrame DateOffset DatetimeIndex DatetimeTZDtype
ExcelFile ExcelWriter Flags Float32Dtype Float64Dtype Grouper HDFStore Index IndexSlice
Int16Dtype Int32Dtype Int64Dtype Int8Dtype Interval IntervalDtype IntervalIndex MultiIndex NA
NaT NamedAgg Period PeriodDtype PeriodIndex RangeIndex Series SparseDtype StringDtype
Timedelta TimedeltaIndex Timestamp UInt16Dtype UInt32Dtype UInt64Dtype UInt8Dtype api array
arrays bdate_range concat crosstab cut date_range describe_option errors eval
factorize from_dummies get_dummies get_option infer_freq interval_range io isna isnull
json_normalize lreshape melt merge merge_asof merge_ordered notna notnull offsets
ojs_define option_context options pandas pd period_range pivot pivot_table plotting
qcut read_clipboard read_csv read_excel read_feather read_fwf read_gbq read_hdf read_html
read_json read_orc read_parquet read_pickle read_sas read_spss read_sql read_sql_query read_sql_table
read_stata read_table read_xml reset_option set_eng_float_format set_option show_versions test testing
timedelta_range to_datetime to_numeric to_pickle to_timedelta tseries unique value_counts wide_to_long
NB: %-commands only work in Jupyter (IPython).
DataFramedict of equal-length lists.DataFrame| student_id | math | english | history | biology | art | |
|---|---|---|---|---|---|---|
| 0 | 5a01 | 95 | 97 | 80 | 81 | 86 |
| 1 | 5a12 | 78 | 91 | 89 | 86 | 81 |
| 2 | 5b05 | 85 | 86 | 94 | 88 | 82 |
| 3 | 5b10 | 90 | 89 | 87 | 99 | 91 |
| 4 | 5e04 | 88 | 90 | 85 | 88 | 84 |
pandas comes with functions for reading and writing to all kinds of data formats. A quick list can be viewed using tab completion:
read_clipboard() read_gbq() read_parquet() read_sql_query()
read_csv() read_hdf() read_pickle() read_sql_table()
read_excel() read_html() read_sas() read_stata()
read_feather() read_json() read_spss() read_table()
read_fwf() read_orc() read_sql() read_xml()
read_csv| student_id | math | english | history | biology | art |
|---|---|---|---|---|---|
| 5a01 | 95 | 97 | 80 | 81 | 86 |
| 5a12 | 78 | 91 | 89 | 86 | 81 |
| 5b05 | 85 | 86 | 94 | 88 | 82 |
| 5b10 | 90 | 89 | 87 | 99 | 91 |
| 5e04 | 88 | 90 | 85 | 88 | 84 |
math scores?”5a12 get on all subjects?”5a12 and 5e04 get on history and art?”| student_id | math | english | history | biology | art | |
|---|---|---|---|---|---|---|
| 0 | 5a01 | 95 | 97 | 80 | 81 | 86 |
| 1 | 5a12 | 78 | 91 | 89 | 86 | 81 |
| 2 | 5b05 | 85 | 86 | 94 | 88 | 82 |
| 3 | 5b10 | 90 | 89 | 87 | 99 | 91 |
| 4 | 5e04 | 88 | 90 | 85 | 88 | 84 |
df.columns and df.indexdf.set_index() returns dataframe with new index| student_id | math | english | history | biology | art |
|---|---|---|---|---|---|
| 5a01 | 95 | 97 | 80 | 81 | 86 |
| 5a12 | 78 | 91 | 89 | 86 | 81 |
| 5b05 | 85 | 86 | 94 | 88 | 82 |
| 5b10 | 90 | 89 | 87 | 99 | 91 |
| 5e04 | 88 | 90 | 85 | 88 | 84 |
.loc.loc for name-based indexing..loc[<INDEX>, <COLS>]
<INDEX> and <COLS> correspond to the index and column names.: to indicate “all”.5a01 get on all (:) exams?”| student_id | math | english | history | biology | art |
|---|---|---|---|---|---|
| 5a01 | 95 | 97 | 80 | 81 | 86 |
| 5a12 | 78 | 91 | 89 | 86 | 81 |
| 5b05 | 85 | 86 | 94 | 88 | 82 |
| 5b10 | 90 | 89 | 87 | 99 | 91 |
| 5e04 | 88 | 90 | 85 | 88 | 84 |
:) get on history?”| student_id | math | english | history | biology | art |
|---|---|---|---|---|---|
| 5a01 | 95 | 97 | 80 | 81 | 86 |
| 5a12 | 78 | 91 | 89 | 86 | 81 |
| 5b05 | 85 | 86 | 94 | 88 | 82 |
| 5b10 | 90 | 89 | 87 | 99 | 91 |
| 5e04 | 88 | 90 | 85 | 88 | 84 |
5a01 and 5a12 (['5a01', '5a12']) get on all (:) exams?”| student_id | math | english | history | biology | art |
|---|---|---|---|---|---|
| 5a01 | 95 | 97 | 80 | 81 | 86 |
| 5a12 | 78 | 91 | 89 | 86 | 81 |
| 5b05 | 85 | 86 | 94 | 88 | 82 |
| 5b10 | 90 | 89 | 87 | 99 | 91 |
| 5e04 | 88 | 90 | 85 | 88 | 84 |
:) get on art and history (['art', 'history'])?| student_id | math | english | history | biology | art |
|---|---|---|---|---|---|
| 5a01 | 95 | 97 | 80 | 81 | 86 |
| 5a12 | 78 | 91 | 89 | 86 | 81 |
| 5b05 | 85 | 86 | 94 | 88 | 82 |
| 5b10 | 90 | 89 | 87 | 99 | 91 |
| 5e04 | 88 | 90 | 85 | 88 | 84 |
Note the order of labels changes the order of columns:
5a01 get in history?| student_id | math | english | history | biology | art |
|---|---|---|---|---|---|
| 5a01 | 95 | 97 | 80 | 81 | 86 |
| 5a12 | 78 | 91 | 89 | 86 | 81 |
| 5b05 | 85 | 86 | 94 | 88 | 82 |
| 5b10 | 90 | 89 | 87 | 99 | 91 |
| 5e04 | 88 | 90 | 85 | 88 | 84 |
5a01 get in history and art?| student_id | math | english | history | biology | art |
|---|---|---|---|---|---|
| 5a01 | 95 | 97 | 80 | 81 | 86 |
| 5a12 | 78 | 91 | 89 | 86 | 81 |
| 5b05 | 85 | 86 | 94 | 88 | 82 |
| 5b10 | 90 | 89 | 87 | 99 | 91 |
| 5e04 | 88 | 90 | 85 | 88 | 84 |
5a01, 5a12, and 5b05 get in biology and art?| student_id | math | english | history | biology | art |
|---|---|---|---|---|---|
| 5a01 | 95 | 97 | 80 | 81 | 86 |
| 5a12 | 78 | 91 | 89 | 86 | 81 |
| 5b05 | 85 | 86 | 94 | 88 | 82 |
| 5b10 | 90 | 89 | 87 | 99 | 91 |
| 5e04 | 88 | 90 | 85 | 88 | 84 |
“The overall scores of students who got 90 or higher in math””
.loc Filtering.loc can take boolean arrays:| math | english | history | biology | art | |
|---|---|---|---|---|---|
| student_id | |||||
| 5a01 | 95 | 97 | 80 | 81 | 86 |
| 5b10 | 90 | 89 | 87 | 99 | 91 |
history scores where math and art \(>=\) 85.iloc: locational indexing:2 in Python)-3:)math score?”5b10’s maximum score?””Clarifying what I mean in this lecture when I say:
Series and DataFramepd.Series are 1-dimensionalpd.DataFrame are 2-dimensionalSeries.loc) on a DataFrame returns a SeriesDataFramelist returns a DataFrameSeries and DataFrame objects have many common operations built-in:
.ne()), \(\gt\) (.gt()), ….mean()), median (.median()), standard deviation (.std()), …math score?”apply and applymapSeries and DataFrames using their .apply method.pd.Series.apply: functions with scalar argumentspd.DataFrame.apply: functions with pd.Series as an argument
axis: 0 is row-wise, 1 is column-wisepd.DataFrame.applymap: functions with scalar argumentsaxis=0)axis=1, using built-in min):| math | english | history | biology | art | |
|---|---|---|---|---|---|
| student_id | |||||
| 5a01 | A | A | B | B | B |
| 5a12 | C | A | B | B | B |
| 5b05 | B | B | A | B | B |
| 5b10 | A | B | B | A | A |
| 5e04 | B | A | B | B | B |
| math | english | history | biology | art | |
|---|---|---|---|---|---|
| student_id | |||||
| 5b11 | 93 | 84 | 82 | 95 | 91 |
| 5c01 | 88 | 73 | 72 | 80 | 84 |
| math | english | history | biology | art | |
|---|---|---|---|---|---|
| student_id | |||||
| 5a01 | 95 | 97 | 80 | 81 | 86 |
| 5a12 | 78 | 91 | 89 | 86 | 81 |
| 5b05 | 85 | 86 | 94 | 88 | 82 |
| 5b10 | 90 | 89 | 87 | 99 | 91 |
| 5e04 | 88 | 90 | 85 | 88 | 84 |
| 5b11 | 93 | 84 | 82 | 95 | 91 |
| 5c01 | 88 | 73 | 72 | 80 | 84 |
.reset_index() moves the index into a column:| StudentID | FirstName | LastName | Age | Gender | Scholarship | ||
|---|---|---|---|---|---|---|---|
| 0 | 5a01 | Alice | Smith | 20 | Female | True | alice@gmail.com |
| 1 | 5a02 | Bob | Higgins | 21 | Male | True | bob@hotmail.com |
| 2 | 5b05 | Charlie | Wylie | 22 | Male | False | charlie@yahoo.com |
| 3 | 5b10 | David | Card | 20 | Male | False | david@gmail.com |
| 4 | 5e04 | Eva | Longman | 23 | Female | False | eva@outlook.com |
| 5 | 5b11 | Frankie | Krueger | 20 | Female | True | frankie@outlook.com |
| 6 | 6a01 | Gerald | Nivea | 19 | Male | False | gerald@gmail.com |
StudentID, in the other student_id.student_id back to the columnsleft and rightonhow is easiest to explain visuallyhow='inner'| FirstName | StudentID | student_id | history | |
|---|---|---|---|---|
| 0 | Alice | 5a01 | 5a01 | 80 |
| 1 | Charlie | 5b05 | 5b05 | 94 |
| 2 | David | 5b10 | 5b10 | 87 |
| 3 | Eva | 5e04 | 5e04 | 85 |
how='left'| FirstName | StudentID | student_id | history | |
|---|---|---|---|---|
| 0 | Alice | 5a01 | 5a01 | 80.0 |
| 1 | Bob | 5a02 | NaN | NaN |
| 2 | Charlie | 5b05 | 5b05 | 94.0 |
| 3 | David | 5b10 | 5b10 | 87.0 |
| 4 | Eva | 5e04 | 5e04 | 85.0 |
| 5 | Frankie | 5b11 | NaN | NaN |
| 6 | Gerald | 6a01 | NaN | NaN |
how='right'| FirstName | StudentID | student_id | history | |
|---|---|---|---|---|
| 0 | Alice | 5a01 | 5a01 | 80 |
| 1 | NaN | NaN | 5a12 | 89 |
| 2 | Charlie | 5b05 | 5b05 | 94 |
| 3 | David | 5b10 | 5b10 | 87 |
| 4 | Eva | 5e04 | 5e04 | 85 |
how='outer'| FirstName | StudentID | student_id | history | |
|---|---|---|---|---|
| 0 | Alice | 5a01 | 5a01 | 80.0 |
| 1 | Bob | 5a02 | NaN | NaN |
| 2 | Charlie | 5b05 | 5b05 | 94.0 |
| 3 | David | 5b10 | 5b10 | 87.0 |
| 4 | Eva | 5e04 | 5e04 | 85.0 |
| 5 | Frankie | 5b11 | NaN | NaN |
| 6 | Gerald | 6a01 | NaN | NaN |
| 7 | NaN | NaN | 5a12 | 89.0 |
| voted | tory_vote | election_interest | civic_duty | party_id | ideo_lr | ideo_pc1 | ideo_pc2 | vote_leave | class | female | edlevel | region | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 1 | 1 | Labour | 3 | -1.204655 | 1.158077 | 1 | 2 | 1 | 0 | East Midlands |
| 1 | 1 | 0 | 1 | 5 | Labour | 0 | 3.077969 | 1.322138 | 0 | 2 | 0 | 1 | East Midlands |
| 2 | 0 | 0 | 2 | 2 | Refused | 5 | -0.634762 | -4.671532 | 1 | 2 | 0 | 0 | East Midlands |
| 3 | 1 | 0 | 1 | 4 | Labour | 6 | 3.686122 | 1.205449 | 1 | 2 | 1 | 0 | East Midlands |
| 4 | 1 | 0 | 1 | 4 | Labour | 2 | 0.139499 | 1.408508 | 0 | 2 | 1 | 4 | East Midlands |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2194 entries, 0 to 2193
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 voted 2194 non-null int64
1 tory_vote 2194 non-null int64
2 election_interest 2194 non-null int64
3 civic_duty 2194 non-null int64
4 party_id 2194 non-null object
5 ideo_lr 2194 non-null int64
6 ideo_pc1 2194 non-null float64
7 ideo_pc2 2194 non-null float64
8 vote_leave 2194 non-null int64
9 class 2194 non-null int64
10 female 2194 non-null int64
11 edlevel 2194 non-null int64
12 region 2194 non-null object
dtypes: float64(2), int64(9), object(2)
memory usage: 223.0+ KB
pd.Series.value_counts() for tabulationpandasThe following sections of Python for Data Analysis: Data Wrangling with Pandas, NumPy and IPython, 2nd edition are relevant to this lecture:
pandas guide to combining dataframes